Insurance Dataset¶

Viewing recommendations¶

For best experience open the pre-rendered .html file

You have 2 options depending on how much you trust my code and how little time you want to dedicate to this review :)

  1. ✅ open the prerendered HTML file, which contains all of the data without any of the rendering limitations that github suffers from
    • Best if you just want to review the analysis and conclusions You have two options for optimal viewing experience:
  2. ✅ open the notebook in jupyterlab (instead of github) and re-run all cells
    • Best if you want to verify that the code works and that the data is not fake

Why you should NOT visualize this notebook on github

Some charts/diagrams/features are not visible in github.

  • This impacts all:
    • plotly plots
    • folium maps
    • embedded images
    • and all other dynamic content
    • quarto meta-tags that improves visualization

This is standard and well-known behaviour.

  • While some workarounds could be possible, there are no universal fixes that resolve all the issues.
    • Plotly customization requires installing external tools, but doesn't fix the other issues
    • Standard workarounds (nbviewer) do not work because the github repo is private.

Requirements for running this locally

  • you will need python 3.9
  • you will need to install the dependencies using pip install -r requirements.txt before proceeding.

Context¶

For this dataset we want to explore different ways to predict a specific column "insurance policy" to assist our ficticious insurance company.

A few important things to remember about the dataset we are going to be using:

  • We do not know if this data is real, or if it's made up.
  • The demographic represented is:
    • Young adults (25 to 35 yo)
    • Travellers
    • Employed
    • Mostly graduated
    • Money amounts are in ₹ (INR: India Rupees)

Based on all these attributes, we are trying to predict which customers decided to buy travel insurance.

The dataset is relatively small (2k rows) because we are a small travel agency.

Performance Metrics¶

Since the number of people who book their vacation through us is fairly small, we want to maximize our chances of selling travel insurance policies. This is a critical revenue stream for us and we cannot afford to leave money on the table.

Prioritizing recall scores¶

This is why we will be looking at models performances, and in particular, we want to find models that have good recall performance.

We want the model to err on the side of "this person is likely to buy, suggest it to them" instead of "I am only flagging customers that I know for sure will buy this insurance".

One simple (non-ML) policy could be "suggest it to everyone", but that would be energy consuming. We don't want to suggest it to all of them and get a bad reputation as an aggressive travel agency, but we do want to end up with a model that has strong recall.

The ideal situation for us would be:


  did not offer insurance offered insurance
did not want insurance 😐 no one cares 🤷 okay...
wanted insurance ⚠️ TERRIBLE
worst mistake you could have ever made
✅ GREAT

So we want models with Perfect True Positive scores, and (more importantly) No False Negatives.

The other two possibilities (True negative and False positive) don't really affect us.

Ideal result for our model¶


  did not offer insurance offered insurance
did not want insurance ? ?
wanted insurance must be 0 as high as possible

We will try to create a custom loss function that prioritizes this goal. Let's see how that goes.

Exploratory Analysis¶

In [87]:
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML
import matplotlib.pyplot as plt
import matplotlib.animation as animation

import duckdb
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
from tqdm import tqdm
import time
from collections import deque
from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    recall_score,
    precision_score,
)
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix

from utils import *
from utils import __
from insurance import *

from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split
from sklearn.exceptions import DataConversionWarning

seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
loading insurance modules... ✅ completed
configuring autoreload... ✅ completed
✅

Let's use black to auto-format all our cells so they adhere to PEP8

In [2]:
import lab_black

%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
In [3]:
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)

warnings.filterwarnings("ignore", category=FutureWarning)

Fetching and loading the dataset¶

We're ready to start! Let's download the dataset from Kaggle.

In [4]:
dataset_name = "tejashvi14/travel-insurance-prediction-data"
db_filename = "TravelInsurancePrediction.csv"

auto_kaggle.download_dataset(dataset_name, db_filename)
__
Kaggle API 1.5.13 - login as 'edualmas'
File [dataset/TravelInsurancePrediction.csv] already exists locally!
No need to re-download dataset [tejashvi14/travel-insurance-prediction-data]
In [5]:
raw_ = auto_pandas.from_csv(f"dataset/{db_filename}", read_csv_kws={"index_col": 0})
In [6]:
raw_.head()
Out[6]:
age employment type graduateornot annualincome familymembers chronicdiseases frequentflyer evertravelledabroad travelinsurance
0 31 Government Sector Yes 400000 6 1 No No 0
1 31 Private Sector/Self Employed Yes 1250000 7 0 No No 0
2 34 Private Sector/Self Employed Yes 500000 4 1 No No 1
3 28 Private Sector/Self Employed Yes 700000 3 1 No No 0
4 28 Private Sector/Self Employed Yes 700000 8 1 Yes No 0
In [7]:
def bool_to_num(df: pd.DataFrame, col: str) -> pd.DataFrame:
    df[f"{col}_number"] = -1
    df.loc[df[col] == "Yes", f"{col}_number"] = 1
    df.loc[df[col] == "No", f"{col}_number"] = 0
    df = df.drop(columns=[col])
    df = df.rename(columns={f"{col}_number": col})
    return df


def employment_to_num(df: pd.DataFrame, col: str) -> pd.DataFrame:
    df[f"{col}_number"] = -1
    df.loc[df[col] == "Yes", f"{col}_number"] = 1
    df.loc[df[col] == "No", f"{col}_number"] = 0
    df = df.drop(columns=[col])
    df = df.rename(columns={f"{col}_number": col})
    return df


clean = raw_.copy()

clean = bool_to_num(clean, "graduateornot")
clean = bool_to_num(clean, "frequentflyer")
clean = bool_to_num(clean, "evertravelledabroad")
clean = pd.get_dummies(clean, drop_first=True, columns=["employment type"])
clean = clean.rename(
    columns={"employment type_Private Sector/Self Employed": "employment_private"}
)

clean.head()
Out[7]:
age annualincome familymembers chronicdiseases travelinsurance graduateornot frequentflyer evertravelledabroad employment_private
0 31 400000 6 1 0 1 0 0 0
1 31 1250000 7 0 0 1 0 0 1
2 34 500000 4 1 1 1 0 0 1
3 28 700000 3 1 0 1 0 0 1
4 28 700000 8 1 0 1 1 0 1

Let's check that all of the values were as expected:

In [8]:
cols = ["graduateornot", "frequentflyer", "evertravelledabroad"]
print("amount of unknowks in each of the classified cols:")
for col in cols:
    print(col, clean[clean[col] == -1].shape[0])
amount of unknowks in each of the classified cols:
graduateornot 0
frequentflyer 0
evertravelledabroad 0
In [9]:
clean = clean.rename(
    columns={
        "annualincome": "income_rupees",
        "familymembers": "family_size",
        "chronicdiseases": "has_chronic_disease",
        "travelinsurance": "travel_insurance",
        "graduateornot": "has_graduated",
        "frequentflyer": "is_frequent_flyer",
        "evertravelledabroad": "has_travelled_abroad",
    }
)

Let's reorder the columns so the predicted variable is at the end of the df

In [10]:
clean.columns = [
    "age",
    "income_rupees",
    "family_size",
    "has_chronic_disease",
    "has_graduated",
    "is_frequent_flyer",
    "has_travelled_abroad",
    "employment_private",
    "travel_insurance",
]
clean
Out[10]:
age income_rupees family_size has_chronic_disease has_graduated is_frequent_flyer has_travelled_abroad employment_private travel_insurance
0 31 400000 6 1 0 1 0 0 0
1 31 1250000 7 0 0 1 0 0 1
2 34 500000 4 1 1 1 0 0 1
3 28 700000 3 1 0 1 0 0 1
4 28 700000 8 1 0 1 1 0 1
... ... ... ... ... ... ... ... ... ...
1982 33 1500000 4 0 1 1 1 1 1
1983 28 1750000 5 1 0 1 0 1 1
1984 28 1150000 6 1 0 1 0 0 1
1985 34 1000000 6 0 1 1 1 1 1
1986 34 500000 4 0 0 1 0 0 1

1987 rows × 9 columns

Let's also include a monetary amounts in EUR, to get a better understanding of magnitudes

In [11]:
# correct as of 2023/05/14
rupees_to_eur = 0.01118

clean["income_eur"] = clean["income_rupees"] * rupees_to_eur
sns.histplot(data=clean, x="income_eur", hue="travel_insurance", multiple="stack")
plt.title("annual income distribution in EUR")
plt.xlabel("income (converted to eur)")
plt.xlim(0)
clean = clean.drop(columns=["income_eur"])
In [12]:
clean.describe().T
Out[12]:
count mean std min 25% 50% 75% max
age 1987.0 29.650226 2.913308 25.0 28.0 29.0 32.0 35.0
income_rupees 1987.0 932762.959235 376855.684748 300000.0 600000.0 900000.0 1250000.0 1800000.0
family_size 1987.0 4.752894 1.60965 2.0 4.0 5.0 6.0 9.0
has_chronic_disease 1987.0 0.277806 0.44803 0.0 0.0 0.0 1.0 1.0
has_graduated 1987.0 0.357323 0.479332 0.0 0.0 0.0 1.0 1.0
is_frequent_flyer 1987.0 0.851535 0.35565 0.0 1.0 1.0 1.0 1.0
has_travelled_abroad 1987.0 0.209864 0.407314 0.0 0.0 0.0 0.0 1.0
employment_private 1987.0 0.191243 0.393379 0.0 0.0 0.0 0.0 1.0
travel_insurance 1987.0 0.713135 0.452412 0.0 0.0 1.0 1.0 1.0
In [13]:
@run
@cached_chart()
def pairplot_with_corrs():
    f = plt.figure(figsize=(8, 8))
    return charts.pairplot_with_corr(clean)
Loading from cache [./cached/charts/pairplot_with_corrs.png]

This is the discovery of the century: People who have a large annual income have a strong correlation with:

  • travelling
    • and travelling abroad
    • and taking travelling insurance
  • working on private sector
In [14]:
clean.corr()
Out[14]:
age income_rupees family_size has_chronic_disease has_graduated is_frequent_flyer has_travelled_abroad employment_private travel_insurance
age 1.000000 -0.020101 0.027409 0.007359 0.061060 0.027125 -0.033159 -0.012779 -0.115134
income_rupees -0.020101 1.000000 -0.015367 -0.001149 0.396763 0.108066 0.353087 0.486043 0.349157
family_size 0.027409 -0.015367 1.000000 0.028209 0.079909 0.021201 -0.023775 -0.020755 -0.003354
has_chronic_disease 0.007359 -0.001149 0.028209 1.000000 0.018190 0.018811 -0.043720 0.021238 -0.011553
has_graduated 0.061060 0.396763 0.079909 0.018190 1.000000 0.018934 0.232103 0.433183 0.147847
is_frequent_flyer 0.027125 0.108066 0.021201 0.018811 0.018934 1.000000 -0.028120 0.062683 -0.127133
has_travelled_abroad -0.033159 0.353087 -0.023775 -0.043720 0.232103 -0.028120 1.000000 0.277334 0.143790
employment_private -0.012779 0.486043 -0.020755 0.021238 0.433183 0.062683 0.277334 1.000000 0.181098
travel_insurance -0.115134 0.349157 -0.003354 -0.011553 0.147847 -0.127133 0.143790 0.181098 1.000000
In [15]:
@run
@cached_profile_report()
def profile_clean_data():
    return ydata_utils.report(clean)
Loading from cache [./cached/profile_report/profile_clean_data.html]

This dataset is looking really good:

  • no missing data
  • no obvious outliers

I think it's time to start modelling.

Predicting outcomes using machine learning¶

For this project we want to predict which clients are likely to purchase travel insurance.

Let's find the clients who will not be able to refuse our offers.

Here's a summary of the available algorithms, extracted directly from this fantastic talk by Brian Lange at PyData 2016

Included in this analysis Name non linear probability estimate feature importance updatable parallelizable
❌ Logistic Regression No Yes Yes, if you scale maybe maybe
✅ SVM Yes, with kernel No No depending on kernel only for some kernels
✅ KNN Yes Yes, % of nearby points No yes yes
❌ Naive Bayes yes yes no yes yes
✅ Decision Trees yes no kinda, number of times a feat is used no no (but it's very fast)
Random Forests ✅
Bagged KNNs ✅
Ensemble models yes kinda (% of models that agree) kinda, depending on component parts (which models you chose) kinda, by adding new models to the ensemble yes
❌ Boosted models yes kinda (% of models that agree) kinda, depending on component parts (which models you chose) kinda, by adding new models to the ensemble no

KNNs¶

The first model we want to try out is K-Nearest Neighbors.

In [16]:
performance: ModelPerformance = knn(clean, k=1, check_using="train")
In [ ]:
performance.confusion_plot()

Not bad for such a simple and naive model. Let's see if the performance improves as we increase k

We don't need to worry about the train/test split, each of the functions will take care of that inside. All the functions will even use a fixed random seed to guarantee repeatable results.

Performance¶

In [21]:
df = clean

Let's compare KNN performance for different values of K, and for each of the 4 algorithms it supports.

We will compare performance differences and behaviour in 2 sceanarios:

  • checking against train
  • checking against test dataset (we would not normally do this, in the real world, but we want to explore how these work under the hood.)
In [148]:
ks = range(1, 100, 5)
algos = ["auto", "ball_tree", "kd_tree", "brute"]
datasets = ["train", "test"]

m = [knn(df, k, algo, dataset) for k in ks for algo in algos for dataset in datasets]
metrics_knn = metrics_as_dataframe(m)
metrics_knn
Out[148]:
check_using K algorithm metric score
0 train 1 auto accuracy 0.976715
1 test 1 auto accuracy 0.934673
2 train 1 ball_tree accuracy 0.976715
3 test 1 ball_tree accuracy 0.934673
4 train 1 kd_tree accuracy 0.976715
... ... ... ... ... ...
635 test 96 ball_tree f1_score 0.869427
636 train 96 kd_tree f1_score 0.864128
637 test 96 kd_tree f1_score 0.869427
638 train 96 brute f1_score 0.864128
639 test 96 brute f1_score 0.869427

640 rows × 5 columns

In [147]:
grid = sns.FacetGrid(data=metrics_knn, row="algorithm", col="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="K", y="score", hue="check_using")
grid.add_legend()
Out[147]:
<seaborn.axisgrid.FacetGrid at 0x200bbcb6a00>

I doubt that KNN is ever used with K > 20, for datasets like this one, but I still wanted to explore how performance degrades as K approaches the size of the entire dataset.

As expected, the best performance seem to appear:

  • for small numbers of K
  • for odd numbers of K (so tie-breaking is not an issue) - recall seems to be particularly sensitive to this.

The last interesting thing is that the choice of algorithm seems to not affect our results in a significant way.

Decision Trees¶

Let's move unto something more interesting: Decision Trees

The parameter that trees use is the max. depth allowed (how many decisions in a row they are allowed to do). Too many decisions will likely lead to overfitting while too few will lead to underfitting, and losing potential performance on the table.

In [154]:
performance: ModelPerformance = decision_tree(df, depth=3)

performance.confusion_plot()

Not bad for a depth of 3.

Let's take a look at the type of decisions this tree is making to classify our data

In [24]:
plt.figure(figsize=(20, 6))
plot_tree(
    performance.extra["model"],
    filled=True,
    feature_names=df.drop(columns="travel_insurance").columns,
)
plt.show()

Just from a quick look we can see that income is a major factor in this decision, since it appears in 3 different nodes.

Age is the next most influential attribute.

Performance¶

In [151]:
depth = range(1, 20)
datasets = ["train", "test"]

m = [decision_tree(df, d, check_using=c) for d in depth for c in datasets]
metrics_tree = metrics_as_dataframe(m)
metrics_tree
Out[151]:
check_using depth metric score
0 train 1 accuracy 0.790434
1 test 1 accuracy 0.821608
2 train 2 accuracy 0.795469
3 test 2 accuracy 0.826633
4 train 3 accuracy 0.796098
... ... ... ... ...
147 test 17 f1_score 0.970537
148 train 18 f1_score 0.986192
149 test 18 f1_score 0.970537
150 train 19 f1_score 0.986192
151 test 19 f1_score 0.970537

152 rows × 4 columns

In [153]:
grid = sns.FacetGrid(data=metrics_tree, col="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="depth", y="score", hue="check_using")
grid.add_legend()
Out[153]:
<seaborn.axisgrid.FacetGrid at 0x200bdaaea00>

It seems that the accuracy does improve a bit if we allow it to add more depth, but after a few layers it flattens out (as expected).

Again, just to reiterate: in production, we would never do what we are doing here (checking against test dataset to understand/tuning hyperparameter), but this analysis is just to check how the performances change with different combinations of the paramters, and these charts show clear evidence of overfitting once the depth goes beyond 6 or 7: the training performance keeps improving, beyond the test dataset. This can be seen in 3 of the 4 metrics inspected.

Random Forests¶

In [155]:
performance: ModelPerformance = random_forest(df, max_depth=3, n_estimators=100)

performance.confusion_plot()

Let's interrogate the random forest and try to extract the "consensus" around the importance of the features

In [29]:
importances = performance.extra["model"].feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = df.columns[indices]

plt.figure()
plt.title("Feature importance")
plt.bar(range(df.shape[1] - 1), importances[indices], color="r", align="center")
plt.xticks(range(df.shape[1] - 1), feature_names, rotation=90)
plt.xlim([-1, df.shape[1] - 1])
plt.show()

Performance¶

Oops! During our initial "let's try combinations" phase, things got out of hand (no need to blame any specific data developer!)

We need to be careful with caching hundreds of random forests!

def random_forests():
    return {
        (size, depth): random_forest(df, max_depth=depth, n_estimators=size)
        for depth in range(1, 20)
        for size in range(50, 1000, 50)
    }

This resulted in 361 different random forests being cached to dist, occupying more than 3GB of space

Our cache on disk is not too pleased!

Our memory and swap space are also displeased about what we just did, and voiced their opinion -loudly- against doing it again.

Let's try again, but with less datapoints.

In [187]:
tree_depths = range(1, 8)
forest_sizes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 60, 120]
datasets = ["train", "test"]
In [170]:
m = [
    random_forest(df, max_depth=depth, n_estimators=size, check_using=c)
    for depth in tree_depths
    for size in forest_sizes
    for c in datasets
]
metrics_forest = metrics_as_dataframe(m)
metrics_forest
Out[170]:
max_depth check_using n_estimators metric score
0 1 train 1 accuracy 0.790434
1 1 test 1 accuracy 0.821608
2 1 train 2 accuracy 0.790434
3 1 test 2 accuracy 0.821608
4 1 train 3 accuracy 0.785400
... ... ... ... ... ...
723 7 test 30 f1_score 0.939444
724 7 train 60 f1_score 0.944842
725 7 test 60 f1_score 0.940789
726 7 train 120 f1_score 0.942857
727 7 test 120 f1_score 0.939245

728 rows × 5 columns

In [176]:
grid = sns.FacetGrid(data=metrics_forest, col="max_depth", row="metric", ylim=(0.5, 1))
grid.map_dataframe(sns.lineplot, x="n_estimators", y="score", hue="check_using")
grid.add_legend()
Out[176]:
<seaborn.axisgrid.FacetGrid at 0x2010eff4340>

We can see that max tree depth does seem to influence accuracy and recall in a predictable way (more tree complexity = better performance), but it's forest size the one that most consistently guarantees performance.

In [197]:
m = metrics_forest[metrics_forest.check_using == "train"].drop(columns=["check_using"])
accuracy = m[m.metric == "accuracy"].drop(columns=["metric"])
precision = m[m.metric == "precision"].drop(columns=["metric"])
recall = m[m.metric == "recall"].drop(columns=["metric"])
f1 = m[m.metric == "f1_score"].drop(columns=["metric"])
In [196]:
fig = plt.figure(figsize=(12, 5))
ax1 = fig.add_subplot(1, 3, 1, projection="3d")
ax2 = fig.add_subplot(1, 3, 2, projection="3d")
ax3 = fig.add_subplot(1, 3, 3, projection="3d")

X, Y = np.meshgrid(tree_depths, forest_sizes)
acc = accuracy["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))
pre = precision["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))
rec = recall["score"].to_numpy().reshape(len(forest_sizes), len(tree_depths))


def render_frame(i):
    ax1.clear()
    ax1.set_title(f"Accuracy")
    ax1.contour3D(X, Y, acc, 100, cmap=sns.color_palette("flare", as_cmap=True))
    ax1.set_xlabel("tree max depth")
    ax1.set_ylabel("forest_size")
    ax1.view_init(20, i * 12)
    ax1.set_zlim(0.80, 1)

    ax2.clear()
    ax2.set_title(f"Precision")
    ax2.contour3D(X, Y, pre, 100, cmap=sns.color_palette("flare", as_cmap=True))
    ax2.set_xlabel("tree max depth")
    ax2.set_ylabel("forest_size")
    ax2.view_init(20, i * 12)
    ax2.set_zlim(0.80, 1)

    ax3.clear()
    ax3.set_title(f"Recall")
    ax3.contour3D(X, Y, rec, 100, cmap=sns.color_palette("flare", as_cmap=True))
    ax3.set_xlabel("tree max depth")
    ax3.set_ylabel("forest_size")
    ax3.view_init(20, i * 12)
    ax3.set_zlim(0.80, 1)


ani = animation.FuncAnimation(fig, render_frame, frames=30)
HTML(ani.to_jshtml())
Out[196]:

Based on this analysis, we can conclude that, while max tree depth does affect somewhat the performance, the factor that really influences accuracy, precision and recall is forest size (up to a certain point).

The ideal spot seems to be: large enough to maximize performance without wasting computing resources.

The other metric that might be important to monitor is the Precision/Recall balance... we can see that at 600 forest size, precision is still improving, but recall is degrading (probably due to overfitting?)

SVM - Support Vector Machines¶

In [204]:
performance: ModelPerformance = svm(clean, kernel="poly", degree=2, scale=False)

performance.confusion_plot()

Dear Sample SVM,

We hearby crown you "Senior Marketing Director in charge of Sales"

Congrats on the well deserved promotion!

Best Regards,

The rest of the company

Okay... This is just a sample of a random SVM, but we do want to explore how they perform in general.

For this exploration of SVMs, we also want to capture duration (model initialization + training + test) to see if there is a noticeable difference between different algorithms

In [36]:
performance.extra["duration"]
Out[36]:
0.04615460000000127

Trying to run SMV without scaling data has proven to be extremely slow (= multiple minutes for "linear w/ 1 degree" model), so we will not compare unscaled v/ scaled.

We'll just make a mental note to always scale the data.

We're going to skip "linear" kernel as it never seems to finish.

Sometimes, the data cannot be classified using a linear model

https://datascience.stackexchange.com/questions/9453/linear-kernel-in-svm-performing-much-worse-than-rbf-or-poly

In [205]:
kernels = ["poly", "rbf", "sigmoid"]
degrees = [0, 1, 2, 3, 4, 5]
scaling = [True, False]
datasets = ["train", "test"]
In [208]:
m = [
    svm(clean, kernel=k, degree=d, scale=s, check_using=c)
    for k in kernels
    for d in degrees
    for s in scaling
    for c in datasets
]

metrics_svm = metrics_as_dataframe(m)
metrics_svm
Out[208]:
check_using degree scale kernel metric score
0 train 0 True poly accuracy 0.708622
1 test 0 True poly accuracy 0.731156
2 train 0 False poly accuracy 0.708622
3 test 0 False poly accuracy 0.731156
4 train 1 True poly accuracy 0.708622
... ... ... ... ... ... ...
283 test 4 False sigmoid f1_score 0.723478
284 train 5 True sigmoid f1_score 0.762527
285 test 5 True sigmoid f1_score 0.786207
286 train 5 False sigmoid f1_score 0.718818
287 test 5 False sigmoid f1_score 0.723478

288 rows × 6 columns

In [ ]:
sns.scatterplot()
In [213]:
grid = sns.FacetGrid(data=metrics_svm, col="kernel", row="metric", ylim=(0.5, 1))
grid.map_dataframe(
    sns.scatterplot, x="degree", y="score", hue="check_using", style="scale"
)
grid.add_legend()
Out[213]:
<seaborn.axisgrid.FacetGrid at 0x200afe63d00>

We can conclude a few things, from these charts:

  • Regarding Kernels:
    • RBF seems to outperform poly and sigmoid (on test). However, if we were to check against train+scaled, we would not see this difference.
    • Sigmoid seems to consistently be the worst performing kernel (for this dataset)
    • "Linear" could not be used as it was taking multiple orders of magnitude longer to complete
  • Regarding degrees:
    • Only Poly seems to be noticeably affected by this parameter.
    • In Poly, the "3 degrees" configuration seems to present better performance than other degrees
  • Regarding Scaling:
    • Using StandardScaling to scale the data does seem to help in most kernels:
    • except RBF, where it seems to get better performance in unscaled data.
  • Regarding performance metrics:
    • Recall seems to consistently outperform other metrics (Accuracy, etc.. )

Ensemble Models¶

Now that we have explored all of the individual "simple" models, let's recover the original objective we had.

Remember that we want to maximize accuracy, but above that, our primary goal is to have great recall.

Let's try to build a few hundred ensemble models with different configurations, and see how we can find the optimal configurations that yield the best results.


Let's create an Ensemble model that is composed of a bunch of KNN models (cheap to train, but comparatively slow to predict).

These are the tweakable parameters of this ensemble model:

  • parameters for the inner model (KNN)
    • n_neighbors: parameter of the KNN models inside. # of neighbors to check in order to predict new values
  • parameters for the ensemble (composite):
    • n_estimators: how many KNN models should it encapsulate
    • max_samples: what proportion of the entire training set is it showing each of the KNN model (during training)
    • max_features: how many features does each KNN model get to see (we can train them on partial sets of features)

Ideally, we'd like to play with all of them, but visualizing 5 metrics (4 for performance + 1 for duration) in a 4-dimensional space will be hard!

So, we might just to a small sample to decide which of the parameters impact the result the most.

In [40]:
performance: ModelPerformance = bagging_knn(
    clean,
    n_estimators=2,
    n_neighbors=5,
    max_samples=0.4,
    max_features=2,
)

print(performance.extra["duration"])
performance.confusion_plot()
4.061390299999999

Not terrible, not great

In [41]:
n_estimators = [1, 5, 10, 15, 20, 25, 50]
n_neighbors = [1, 2, 3]
max_samples = [0.1, 0.3, 0.5, 0.8, 1.0]
max_features = [1, 2, 3, 5]

bag_knn_performances = {
    (e, n, s, f): bagging_knn(
        clean,
        n_estimators=e,
        n_neighbors=n,
        max_samples=s,
        max_features=f,
    )
    for e in n_estimators
    for n in n_neighbors
    for s in max_samples
    for f in max_features
}
In [42]:
n_estimators = []
n_neighbors = []
max_samples = []
max_features = []
accuracies = []
precisions = []
recalls = []
f1s = []
durations = []

for model in bag_knn_performances.values():
    n_estimators.append(model.params["n_estimators"])
    n_neighbors.append(model.params["n_neighbors"])
    max_samples.append(model.params["max_samples"])
    max_features.append(model.params["max_features"])
    accuracies.append(model.accuracy())
    precisions.append(model.precision())
    recalls.append(model.recall())
    f1s.append(model.f1_score())
    durations.append(model.extra["duration"])
In [43]:
performances = pd.DataFrame(
    {
        "n_estimators": n_estimators,
        "n_neighbors": n_neighbors,
        "max_samples": max_samples,
        "max_features": max_features,
        "accuracies": accuracies,
        "precisions": precisions,
        "recalls": recalls,
        "f1s": f1s,
        "durations": durations,
    }
)
performances.sample(10)
Out[43]:
n_estimators n_neighbors max_samples max_features accuracies precisions recalls f1s durations
308 25.0 1.0 0.5 1.0 0.756281 0.750000 1.000000 0.857143 0.075150
75 5.0 1.0 0.8 5.0 0.886935 0.886792 0.969072 0.926108 0.032473
262 20.0 2.0 0.1 3.0 0.909548 0.899687 0.986254 0.940984 0.058104
245 20.0 1.0 0.3 2.0 0.889447 0.884735 0.975945 0.928105 0.065183
61 5.0 1.0 0.1 2.0 0.859296 0.863777 0.958763 0.908795 0.044321
217 15.0 2.0 1.0 2.0 0.902010 0.884146 0.996564 0.936995 0.059607
119 5.0 3.0 1.0 5.0 0.907035 0.909677 0.969072 0.938436 0.041478
329 25.0 2.0 0.5 2.0 0.821608 0.807263 0.993127 0.890601 0.091529
173 10.0 3.0 0.8 2.0 0.886935 0.868263 0.996564 0.928000 0.054871
27 1.0 2.0 0.3 5.0 0.889447 0.955720 0.890034 0.921708 0.006751
In [44]:
charts.pairplot_with_corr(performances)
Out[44]:
<seaborn.axisgrid.PairGrid at 0x21a3e818610>

It seems that recall is somewhat correlated with n_estimators.

Increasing all of the other meta parameters does help (positive corr), but the number of estimators contributes the most to improving the overall model convergence unto the correct prediction.

That does not mean that we need to go overboard with this. Given the simple dataset we have, a ton of different combinations ended un with perfect recall.

In [45]:
sns.histplot(performances["recalls"])
Out[45]:
<AxesSubplot: xlabel='recalls', ylabel='Count'>
In [46]:
top_recalls = performances[performances.recalls == 1.0].sort_values(
    by=["accuracies"], ascending=False
)
top_recalls.head(10)
Out[46]:
n_estimators n_neighbors max_samples max_features accuracies precisions recalls f1s durations
237 15.0 3.0 1.0 2.0 0.891960 0.871257 1.0 0.931200 0.075590
293 20.0 3.0 0.8 2.0 0.879397 0.858407 1.0 0.923810 0.065442
297 20.0 3.0 1.0 2.0 0.874372 0.853372 1.0 0.920886 0.080150
382 50.0 2.0 0.1 3.0 0.859296 0.838617 1.0 0.912226 0.090328
357 25.0 3.0 1.0 2.0 0.856784 0.836207 1.0 0.910798 0.073806
65 5.0 1.0 0.3 2.0 0.851759 0.831429 1.0 0.907956 0.038535
397 50.0 2.0 1.0 2.0 0.851759 0.831429 1.0 0.907956 0.106347
353 25.0 3.0 0.8 2.0 0.851759 0.831429 1.0 0.907956 0.070790
417 50.0 3.0 1.0 2.0 0.846734 0.826705 1.0 0.905132 0.099445
393 50.0 2.0 0.8 2.0 0.834171 0.815126 1.0 0.898148 0.101280
In [47]:
sns.stripplot(
    top_recalls,
    x="n_estimators",
    y="max_features",
    jitter=0.1,
    hue="accuracies",
)
plt.ylim(0)
Out[47]:
(0.0, 3.1)

All of these dots represent a model with excellent recall (1.0).

We are trying to find if there are patterns that result in also great accuracy (dark dots).

After checking various combinations of data slicing and dicing (by setting X and Y axis to different metrics), this chart is the only one that seems to have some patterns. Unsurprisingly, it tells us what we already imagined: more models with access to more features, results in better accuracy (while keeping recall at the peak of 1.0)

When comparing the durations of these models, things dont seem to differ too much, but we're not going to generalise about speed: it's probable that with such a small dataset, the difference in speed is impossible to notice in a significant way (is 0.02 seconds really a decision maker, when we spend minutes, hours or days munging the data manually?)

Custom loss function¶

Alright, so far we have explored some simple/composite models

But none of these comparisons have helped us make a decision about which way to go. We have found models with perfect recall, models with excellent precision, and models that always predicted "YES! Sell this person some travel insurance!"

But what we really want is to find an easy way to assess them all easily and see which one might work best, according to our criteria of "ideal performance"

Remember that our ideal performance was:

  1. Good recall (never miss a sale! if there's a small possibility of a sale, suggest it)
  2. Accuracy as high as possible.

We want to create a custom "loss function"/scorer that encodes this definition of success. It should penalise false negatives more strongly than it rewards true positives.

It will also ignore the other metrics because we truly don't care about them... As in, we care about them indirectly only.

In [52]:
def missed_sales_are_painful(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return (10 * fn) - tp


aggressive_salesman = make_scorer(missed_sales_are_painful, greater_is_better=False)
In [53]:
performance = cross_validation_custom_score(
    clean, custom_scorer=aggressive_salesman, n_iter_search=10
)
best_params = performance.extra["best_params"]
best_params
Out[53]:
{'C': 0.8802345194777873, 'degree': 0, 'kernel': 'poly'}

This certainly matches our expectation.

If you remember, we had already found this to be an excellent model that guarantees great recall.

Considering how expensive we made the recall errors, it's no surprise that it picks Poly-0-degrees for our custom loss function. The tiny variation in precision/accuracy are not really impacting the final result.

In [54]:
svm_performance = svm(clean, kernel=best_params["kernel"], degree=best_params["degree"])
svm_performance.confusion_plot()

The problem with this model is that it basically makes our policy "always suggest getting insurance".

The fundamental problem is that we have made our loss function too aggressive at punishing problems,... so we ended up with this Alec Baldwin's character in Glengarry Glen Ross:

It seems that our machine learning algorithm we have created a monster.

Let that be a lesson to us all. Models only see the data we present to them, optimization follow our loss function/utility function to the limit.

A less aggressive custom function¶

As it was expected, this scoring function forces our model to be too aggressive and we're worried that this might annoy some customers.

We want to give some penalty if our model is too annoying towards customers who are extremely unlikely to buy insurance.

We want to add a small penalty for those cases.

We will also change the greater_is_better so it's easier to understand our scoring

In [80]:
def missed_sales_are_painful_but_lets_not_annoy_customers(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return (tp - (1 * fn)) - (2 * fp)


less_annoying_salesman = make_scorer(
    missed_sales_are_painful_but_lets_not_annoy_customers, greater_is_better=True
)

This is the new company policy:

  did not offer insurance offered insurance
did not want insurance 😐 no one cares 🛑 let's avoid it if they clearly DON'T want it
wanted insurance ⚠️ no longer "TERRIBLE" but still bad ✅ GREAT
In [81]:
performance = cross_validation_custom_score(
    clean, custom_scorer=less_annoying_salesman, n_iter_search=10
)
best_params = performance.extra["best_params"]
best_params
Out[81]:
{'C': 0.6179572471872472, 'degree': 0, 'kernel': 'rbf'}
In [82]:
svm_performance = svm(clean, kernel=best_params["kernel"], degree=best_params["degree"])
svm_performance.confusion_plot()

This is clearly better... Now our model is balancing offering insurance only when it's really/somewhat confident that they want it... and it's also not offering it in some circumstances.

We have 40 customers who no longer hate us (because we are not annoying them with insurances they didn't really want), and it only had a small cost (4 customers did want it, but we did not offer it).

Keeping 40 loyal customers returning year after year is definitely worth this tiny cost. 👏

Good Job Support Vector Classifier!

Conclusions¶

In this analysis, we've used a Kaggle dataset about travel insurance policies to compare and contrast various machine learning algorithms.

The algorithms covered are:

  • simple models
    • KNNs
    • Decision Trees
    • Random Forests
    • SVMs
  • Composite/Ensemble models
    • which can be composed of other models and can delegate to their consensus.

Executive Summary¶

Dataset¶

The dataset selected for this comparison analysis is almost perfect:

  • clean
  • no missing data
  • requires minimal data transformations
  • all data makes sense and there are no outliers

Models explored¶

  • K-Nearest Neighbors
    • performed relatively well, even with very naive parameters (k=1), which is more a testament to the dataset's simplicity and its clustered shaped data, than to the model itself.
    • Most of the best performance was with low Ks, and as K kept incrementing all performance metrics started to degrade (understandable given the small size of the dataset)
  • Decision Trees
    • Performed similarly well: good accuracy and strong sustained recall as the max_depth increased.
  • Random Forests
    • Provided similar performance, and also allowed us to peek inside and see which features the aggreated model considered to be important to make predictions (Income and Age were the strongest drivers for acquiring travel insurance)
    • We noticed that importances always show positive scores so we can only measure strength, but not direction of the contribution (positive/negative) to the predicted result.
    • Due to a minor miscalculation, we also learned how large these models can be, and how long it takes them to store to disk (when you have accidentally created 300 of them)
    • Visualizing performance and tradeoffs became a bit harder because we had more parameters than before.
      • The factor that really influences accuracy, precision and recall is forest size (more than max_depth for the trees)
  • Support Vector Machines
    • Due to the complexity and flexibility of SVMs, we decided to try a few combinations only, instead of doing an exhaustive comparison.
    • We covered various functions/transformations (polynomial functions, radial basis functions and sigmoid functions)
    • We also explored the different performances over various degrees, and with/without previous scaling.
    • Interestingly, for this dataset:
      • Polynomial and Sigmoid functions benefitted the most from scaling data
      • while RBF got marginally worse performance when using scaled data
  • Ensemble Models
    • These allowed us to make richer composite models that leveraged multiple simpler models to achieve performance better than a single complex one.
    • The parameter that improved our desired metric (recall) the most was number_of_estimators.

Finally, we have also explored cross validation and using custom loss functions to be able to quantify and rate models' performances according to what we wish to optimize.

Instead of doing like in the rest of the exercise to loop and iterate over various combinations of parameters, we used RandomizedSearchCV to do the heavy lifting for us (searching and selection of optimal paramters for the ideal support vector machine)

Unfortunately, our custom loss function was very strong at penalizing false negatives, so the selected model turned out to be a monster that recommends travel insurance to all of our clients, regardless of their attributes.

Once this was adjusted (by giving it a new score function to optimize), the model became a bit more manageable and only recommended insurance to clients it had a strong confidence they would be interested in.

Let this be a reminder that the system will optimize whatever it is that you rewards, intentionally or accidentally.